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Abstract 



In Hybrid Monte Carlo simulations for full QCD, the gauge fields evolve smoothly 
as a function of Molecular Dynamics time. Here we investigate improved methods of 
estimating the trial or starting solutions for the Dirac matrix inversion as superpositions 
of a chronological sequence of solutions in the recent past. By taking as the trial solution 
the vector which minimizes the residual in the linear space spanned by the past solutions, 
the number of conjugate gradient iterations per unit MD time is decreased by at least 
a factor of 2. Extensions of this basic approach to precondition the conjugate gradient 
iterations are also discussed. 



I. Introduction 

At present, typical methods for going beyond the quenched (or valence) approximation 
for lattice QCD involve an increase of several orders of magnitude in computational time - 
severely limiting statistics on large lattices. Consequently, finding a more efficient approach 
to the inclusion of internal fermion loops poses a major challenge for the next generation of 
lattice simulations 

The most successful approach to date for generating full QCD configurations is the 
so-called Hybrid Monte Carlo (HMC) algorithm 0], which uses Molecular Dynamic (MD) 
evolution in a "fifth time" coordinate t. At each time step, the Dirac matrix must be inverted 
to calculate the force due to fermion loops. This inversion is the most time-consuming part 
of the full QCD algorithm. The Dirac operator is represented as a 12V by 12V sparse square 
matrix, where the number of space-time lattice sites (or volume V) is on the order of 10 5 ~ 6 . 
With the lattice volumes used in recent simulations, the Dirac matrix may account for 90% 
or more of the computer's time. Moreover, due to the Dirac inverter, the problem scales as 
(l/a) 6 ~ 7 . Therefore any acceleration of the Dirac inversion implies a major improvement in 
the overall performance of generating QCD configurations. 

Our simple observation is that by virtue of the system's smooth evolution in MD time, 
it should be possible to use information from the recent past to perform the next inverse more 
efficiently f|]. 

Our present approach to make use of the past data is straightforward pi]. Since the 
conjugate gradient is an iterative procedure, a trial starting configuration is needed. Our 
method is to "extrapolate" from a set of the most recent solutions of the Dirac inversion in 
order to find a better trial solution for the next inversion. This idea is legitimized by the 
observation, that if the inverse is solved exactly, detail balance is preserved, regardless of the 
starting trial solution for the conjugate gradient. Therefore we do not propose to modify 
the QCD algorithm, but only to provide a prescription for better selecting the starting trial 
solution for the Dirac inverter. 

This method, like the interesting suggestion of Liischer ||, relies on the observation that 
the new generation of supercomputers often has very large memories and that the efficient use 
of the entire resource is no longer simply the optimization of the CPU but also of its memory 
and its communication systems. In particular, improved algorithms based on the exploitation 
of a large memory are to be expected, since most QCD algorithms have been designed at a 
time when memory was at a premium. 

Several possibilities can be taken into account. The first obvious step, employed fre- 
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quently in HMC codes, is simply to use the previous solution as the starting trial solution. 
Some groups have even used a linear extrapolation of the last two solutions 0, [?]]. The next 
natural step is to use a high order polynomial extrapolation We analyzed the polynomial 
extrapolation up to the sixth order and we have observed that it gives a robust speed up 
in the CG (see section V). Motivated by this success, we then considered the possibility of 
fixing the trial vector as a linear combination of the N previous solutions, by minimizing the 
residual in the norm of the matrix (see section IV). This approach, which we call the Minimal 
Residual Extrapolation (MRE) method, further improved the performance. We also explored 
other possibilities, such as the Kalman filter [^], however, in our simulations, none of these 
methods offers better performance than the Minimal Residual Extrapolation method. 

Of course, as discussed in the conclusions, there may well be more sophisticated ways 
to use past information, which will lead to a still more efficient algorithm. For example, it 
appears to us that a sensible strategy is to exploit the past Krylov spaces to build preconditions 
for the conjugate gradient iteration itself. As an illustration, we present a modest step in 
this direction which builds a preconditioner from the same vectors used for the MRE trial 
solution (see Section VII.). We gain an additional small improvement in performance, which 
encourages us to direct our future research toward chronological preconditioners. 

It is also noteworthy that this generic problem of a chronological sequence of matrix 
inversions is not unique to HMC for QCD. For example, in many fluid dynamics codes, one 
must solve Poisson's equation for the pressure field as an inner loop for integrating the Navier 
Stokes equation || . Consequently, there will be other circumstances for developing and testing 
these algorithms. 



II. Hybrid Monte Carlo for QCD 



In this section, a brief review of the Hybrid Monte Carlo (HMC) algorithm for full 
QCD is presented. This algorithm has the advantage that, apart from numerical round off, it 
provides an exact Markov process for generating QCD configurations. The starting point is 
the Euclidean partition function for QCD, 

Z QCD = J D^DUe~ S ^ U ) + ^ M ^ , (1) 

where the gauge variables are represented by unitary matrices Uu(x) = J"^-y-\ x ) and the 
Fermions by Grassmann variables ip x - assigned to links and sites respectively. S g (U) is the 
pure gauge action and M(U) is the Dirac matrix, 

M x>y = S Xt y - k(1 - ^ fl )U^(x)5 x+ ^ y - k(1 + 7 M )f7T(x)4, a + iU , (2) 

where k is the hopping parameter. 
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We restrict our attention to 2 flavors of Wilson Fermions, so that integrating out the 
Grassmann variables yields the square of the Dirac determinant. Consequently, with the 
identity 



det(M) 



nf 



det (M f M) 



nf/2 



(3) 



we may rewrite the determinant as integral over a positive definite Gaussian in terms of 
a set of bosonic "pseudo-fermion" fields (p x . The HMC algorithm also requires additional 
canonical (angular) momenta coordinates, En{x), conjugate to the gauge fields A^{x) on each 
link (x,x + fi). Combining these steps the QCD partition function is now rewritten as 



■>QCD 



J DU DE Dip e -H(U,E,<p) ^ 



with an entirely bosonic action given by 

H(U,E,ip) = \Tr E 2 + S g (U) + <p+[Aft M]~V • 



(4) 



(5) 



The standard HMC algorithm uses alternates Gaussian updates with Hamiltonian evolution to 
achieve detailed balance and ergodicity: At the beginning of each Hamiltonian trajectory the 
momenta, E, and "pseudo-fermion" fields, b = ip, are chosen as independent Gaussian 

random variables. Next the gauge fields are evolved for MD time T using the Hamiltonian 
equations of motion for fixed values of the "pseudo-fermion" fields (p. It is easy to prove that 
this results in Markov process for the gauge fields U which leaves probability distribution, 



V{U) 



DipDipe °9 



Sg(U)+^M(U)^ 



(6) 



invariant. 



In practice, the Hamiltonian evolution must be approximate by an integration scheme 
with finite step size St. Consequently, at the end of each MD trajectory, a Metropolis accept- 
reject test is introduced to remove integration errors which would otherwise corrupt the action 
being simulated. In addition, it is also essential that the MD integration method exactly 
obey time reversal invariance to avoid violating detailed balance. Clearly small violations 
of detailed balance are inevitable due to round off errors. Current practice usually employs 
the leapfrog integration scheme in single precision arithmetic. Also, the standard odd-even 
preconditioning 0, 10 1 , has been implemented by the replacement, 



M = 1 — K K eo - KK oe -> M ee = 1 - K 2 K eo K oe . (7) 
In this case, only pseudo-Fermions tp on the even sites have to be introduced and the matrix, 

A = M\ e M ee , (8) 

is used in the algorithm. 
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There are many small variations on this basic method, but common to all HMC algo- 
rithms is the need to accurately integrate the equations of motion, calculating the force on U 
due to the pseudo-Fermions at each time step t n . The force of the fermion loops on the gauge 
field requires solving the Dirac equation, 

A(t n ) x (*„) = <p , (9) 

over and over again for the propagator x(t), where A(t) = M(U)^M{U). Technically, this 
is achieved by starting with a trial value \triai 

and iteratively solving Equation @ for x{t)- 
The operator A(t) changes smoothly as a function of the MD time t, as new values of U 
are generated. During the MD steps for which ip is held fixed, the solution must also 
change smoothly in time t. The solution of @, which is usually obtained using the conjugate 
gradient (CG) method, is the most computationally expensive part of Hybrid Monte Carlo 
algorithm. Therefore improvements in this part will result in nearly a proportional net gain 
in the performance of the full QCD simulations. 



III. Simulations 



We have investigated the performance of our methods using the MIT-BU collaboration 
code for the CM-5. We have available two CM-5, one with 128 nodes and the other with 64 
nodes. We used the configurations generated for the MIT-BU lattice collaboration. We tested 
various values of the molecular dynamics step St for 5 or more independent thermalized full 
QCD configurations, separated by approximately 100 MD trajectories. 



In our tests we performed extensive simulations on a 16 lattice at /3 = 5.5 and k = 
0.160. We choose St in a typical window for actual simulations (0.002 < St < 0.015). Some 
simulations were performed with a lighter quark mass and we obtained similar results 0], 
although the magnitude of the improvement depended slightly on the lattice parameters. 



We used a standard CG method, and we iterate until the normalized squared residual, 

\x 



R = ^ ee "T* » (10) 



reaches a given value. We choose the stopping condition, in much the same spirit as refer- 
ence J?]], so that the error in computing the AS, used in the Metropolis accept-reject step, 
should on average be less than 1%. However it is not really known how this error propagates 
to physical quantities. 



A critical issue is the requirement to converge accurately to the final solution for \. 
Unless you start from a value of x which is independent of past values (e.g. x = 99), the initial 
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guess introduces an element of non-reversible dynamics and therefore a failure of detailed 
balance. Even in one of the most commonly used HMC methods, starting from the last value 
of x (° r a constant extrapolation) , failure to converge to the correct inverse introduces a bias 
that breaks time reversal invariance. In principle, only if the CG has converged exactly to 
the fixed point of the conjugate gradient iterations, there will be no violation of time reversal 
invariance. In practice, we have measured this source of error by explicitly reversing the MD 
dynamics for various values of the stopping conditions (see Eq. [h]) . In Figures we show 
examples for R = 1(T 10 , R = 10 -11 , R = 1(T 12 , R = 10~ 13 , R = 10" 14 and R = 1(T 15 . In 
Figure |l| we can see the action variation (total, fermion, gauge and momenta) in a trajectory 
that goes forward in MD time 100 steps and then is reversed for 100 steps. We have subtracted 
the initial action, so if the dynamics conserves energy, the total action should remain zero 
and if the dynamics is reversible, the total action should be symmetric around the MD time 
t = 100. 



Our data demonstrates that simulations with R > 10 are not sufficient to ensure 
a reversible dynamics. In Figure we plot the total action difference between symmetric 
points in a forward-backward trajectory. If the dynamics is reversible, this difference should 
be zero. In this figure we can see that for R > 10~ 13 the dynamics is not reversible. For 
R = 10~ 13 the action difference observed is of order of 10 -2 where the total action is of order 
10 7 , which means that we are at the boundaries of single precision accuracy. However, for 
R = 10~ 14 we are well bellow the single precision accuracy. Smaller action variations can not 
be detected. Consequently, we chose to run our simulation using R = 10~ 14 as the conjugate 
gradient stopping criterion. 



We measured the total number of CG steps needed to invert the Dirac Matrix to a 
given precision. The CPU time for a single leap-frog trajectory is practically proportional to 
the number of CG steps. However, the true performance of the method should be judged by 
the total number of CG steps needed to evolve the system for a fixed MD time, i.e. the total 
number of CG steps needed to compute a trajectory of time length T. We define the quantity, 

cr=^p, (ii) 

which is proportional to the "computational time". Moreover, since a smaller step size will 
improve the acceptance rate, the trade-off may be even more favorable for small 5t, and this 
effect should also be included in an overall measure of efficiency. However, because we always 
worked with 5t in a region of very good acceptance rates (> 99%), we have not included this 
effect in our estimate of overall efficiency in our simulations. 



IV. Chronological Inverter for QCD 
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We began our investigation by considering a trial solution, x(0), of the new Dirac matrix 
A(0), as a linear superposition of old solutions, 

Xtriai = ci + c 2 X(h) H h x(*Jv)- (12) 

To simplify our notation, we will always suppose that the new inverse is computed at t = 0, 
given the past values at tj, with • • • < £3 < ti < t\ < 0. In practice, this is usually a regular 
series of values t n = —n 5t with an integration step 5t. 



To date HMC simulations have used either the past solution as a trial solution, Xtriai = 
x(ti), or a linear H, [?J extrapolation, Xtriai = 2x(ii) — x(*2)- It is natural to try to improve 
the estimate for the trial value by using higher order polynomial extrapolations. For example, 
if one uses an N-th order polynomial to fit N+l past values, the coefficients are given by 

* - W^W- ■ (13) 

Although substantial improvements can be made using higher order polynomial extrapola- 
tions, we found that the method breaks down for large N. We have also considered using over 
constrained fits and some estimators based on a Kalman filter. But we soon discovered that 
a more appealing approach to the extrapolation problem. 



Since the conjugate gradient method is in fact just a minimal residual technique con- 
fined to the Krylov []Tl]] subspace spanned by vectors A^^xtriah wn Y n °t start by examining 
a "smarter" subspace based on past success for nearby times? In this spirit, we suggest 
determining the coefficients c n by minimizing the functional, 

*M = X ] M^M X - rfx ~ XV , (14) 

which is the same functional minimized by the Conjugated Gradient method itself. This 
corresponds to the minimization of the norm of the residual in the norm of the inverse matrix, 

^Wm T = X f M^Mx-^X-X j f + ^b, (15) 

in the subspace spanned by Xi — x(^«)> w h ere t = <j> — M'Mx and b = (M^) -1 ^. Therefore 
we call this method "Chronological Inverter by Minimal Residual Extrapolation" or MRE. 
The minimization condition reduces to 

2V 

J2 xfMiMxj Cj = X »V • (16) 
i=i 

The only technical problem is that this system can be poorly conditioned because the past 
solutions Xi differ from each other by order 5t. However, if properly handled, this instability 
should not affect the resulting minima of the quadratic form in the span of the vectors, Xi- 
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We have explored other methods of extrapolations to some extent, in particular other slightly 
different definitions of residue (not in the norm of the inverse matrix) were examined, but 
nothing appears at present to out-perform the Minimal Residual Extrapolation method. 



V. Chronological Inverter by Minimal Residual Extrapolation 

We implemented this method by calling a routine that minimizes the residual in the 
subspace of the past solutions, right before each CG in the usual leapfrog. This routine has 
as inputs the set of past vectors Xi> i = and returns a vector Xtrial that minimizes VPfx] 

in the span(xi)'- 

MRE Algorithm 

• Construct an orthogonal basis Vi in the span(xi) using the Gram-Schmidt procedure. 

• Form the sub-matrix G nm = v n < Mv m , and the vector b n = v n 'tp. 

• Solve G nm a m = b n using the Gauss-Jordan method. 

• Return the trial vector Xtrial = J2n=i a nV n - 

Because x«' s are nearly linearly dependent there is a numerical instability, and the Gram- 
Schmidt procedure does not produce a true basis of span(xi) due to round off error. Hoverer 
span(vi) is a subspace close enough to the relevant subspace of the old solutions. Because 
the accumulated error is larger at the late stages of the Gram-Schmitt iteration, the ordering 
of the Xi is important. We get better results if we number the Xi's from the newest to the 
oldest. This way we quite accurately pick up the new relevant directions, while the old ones, 
which are less relevant, are computed with more round off error. In this way we can handle 
the numeric instability and turn it to our advantage. 

This method requires (N 2 + 5N)/2 dot products and N Mx matrix- vector applications, 
and the storage of 3N past pseudo-fermion configurations. There are other implementations 
of the above concepts which can do the same thing using less memory. But most of them 
suffer from numerical instabilities. The best we found, and actually used in our production 
runs, is instead of storing store v's, keeping them in the right order - most recent first. 
This reduces the memory requirement to the storage of 2N pseudo-fermion configurations, 
and has the same performance as the original method. We checked this method using double 
precision, and there was no visible improvement by reducing the round off-error. 

In Table |l|, we record the mean number of CG steps required to reach the solution 
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normalized relative to starting with Xtrial = '-P- We used the stopping condition that the 
normalized squared residual ( |l0|) was smaller than 10 -14 . 

The data (see Table [j] and Figures |3|, |] ) , suggest that the more vectors you keep from 
the past the better starting residue you achieve, and as a result the fewer CG steps you need 
in order to converge to a given accuracy. Furthermore the number of CG steps is decreasing 
as St is decreasing. We have also presented our results as a contour plot in Figure || for the 
number of iterations as a function of the number of vectors versus the time step. 
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Table 1: Number of CG steps needed to converge to the solution from minimum residual 
extrapolation. The table is normalized with respect to N = (i.e. no extrapolation, Xtrial = f)- 
The statistical errors are of the order of 10%. 



As explained above, the performance of the method given in "computational time" CT 
is summarized in Table ||, and in Figures ^J^. It is interesting to note how CT depends on 
the number of the vectors retained. Unlike the number of CG steps, CT does not decrease 
as 5t becomes smaller. Moreover, one can see from Figure that there is a range of 5t where 
the performance is relatively insensitive to 5t. As you decrease the step size, you gained back 
almost the same performance by reducing proportionally the number of CG steps required for 
convergence in each step. As discussed earlier, smaller step sizes have the extra advantage of 
improving the acceptance rate. 
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Table 2: Computational Time ( CT) starting from minimum residual extrapolation. The table 
is normalized to 1 at 5t = 0.010 for N = 1 ( Xtrial = Xlast)- 



VI. Comparison with the Polynomial Extrapolation Method 



The polynomial extrapolation method has the advantage that it requires very little 
computational effort, just a local sum on each lattice point with fixed coefficients given once 
and for all by equation (|l3|), costing less than a single CG step. For a polynomial of order N, 
the only storage requirement is for the previous N pseudo-fermion configurations, x(U)- 



Results of the polynomial extrapolation are shown in Tables pLJ| and Figures |8|,p|JlO|,ll, 
12. Again, to compare efficiencies, the CG iterations should be divided by St, so we compare 
the total number of CG steps needed to evolve the system for a fixed distance in configuration 
space. Note that if St is large, the overall performance is good, but the acceptance will drop 
drastically. If St is small, the extrapolation is excellent, but the system will evolve too slowly 
in phase space. It is noteworthy that we find a window in St where both the acceptance is 
good and the extrapolation works well. 



We also note that the coefficients of the Minimal Residual Extrapolation method (12), 
which a priori are generic complex numbers, are in fact very close to coefficients in the poly- 
nomial extrapolation ( Jl3| ) for the first few orders. One way to understand this coincidence is 
to observe that for a smooth evolution the determination of coefficients a, 



XtHai = ci x(h) + c 2 x(h) H h c N x(*Jv), (17) 
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Table 3: Number of CG steps to reach the solution for polynomial extrapolation. Normalized 
to 1 for N = (xtrial = <f)- 



by a polynomial fit is equivalent to fixing the coefficients by making a Taylor expansion of 
each term, x(^n)> in t n = nSt canceling all contributions to 0(5t N ). To prove this we solve 
the constraints 

N 

J2( t iT~ 1 yn = x(t i ), (is) 

n=l 

for a polynomial fit y(t) = yx +t y% + ■ ■ ■ + i^ -1 y^ to find y\ = Xtrial- Then show that 
yi = J2i c iX(U) when we enforce the Taylor series constraints, 

N 

X^r- 1 ^ = s 1>n . (19) 
i=i 



From this exercise, we conclude that the success of the polynomial fit probably results 
from the local convergence of the power series MD in time. On the other hand, if we compare 
the magnitude of the estimated residual of the Minimal Residual Extrapolation (MRE) versus 
the Polynomial Extrapolation (see Figures ||, ^), we notice that while the polynomial actually 
deteriorates at high order, the minimal residual continues to improve. In our simulation for a 
polynomial extrapolation of order of N = 5 — 6, the deterioration becomes appreciable. 

In conclusion the increased efficiency of the MRE method is worth the extra computa- 
tional effort (equivalent to N/2 CG steps). With a Polynomial Extrapolation, one gets poor 
results on a few exceptional lattices, whereas the MRE method is a more robust estimator, 
which implements a kind of self-tuned extrapolation, never doing worse than a polynomial fit 
of the same order. 
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Table 4: Computational Time ( CT) starting from polynomial extrapolation. The table is 
normalized to 1 at St = 0.010 for N = 1 (xtrial = Xlast)- 



VII. Projective Conjugate Gradient 

It is natural at this point to try to understand why the Chronological Inverter is working 
and to explore more effective methods of using the past information. We have performed a 
large number of "numerical experiments" to gain some understanding. In particular we have 
shown that if one assumes complete knowledge of the inverse of the matrix in the immediate 
past A% = A(ti) and performs a perturbation expansion in the difference V = A — A\ con- 
vergence is poor unless St is very small on the order of 10~ 3 . On the other hand, although 
we know no way to efficiently implement the idea, if we use A\ as a preconditioner for the 
conjugate gradient iterations of A, a spectacular acceleration of the conjugate gradient is 
achieved — for St = 0.01 converging to R = 10 -16 in about 15 iterations. In this way, we 
are now convinced that the use of past information as a preconditioner holds out even greater 
promise. We are engaged in on going research in this regard. However, here we report a minor 
extension of the current method along these lines. 

In the Minimal Residual Extrapolation, we suggest re- using our past solutions {xi} to 
construct a preconditioner for the subsequent CG iterations. The straight forward modifi- 
cation is as follows. Since we have already minimized the functional Vl/(x) in the subspace 
spanned by Xi s i it is reasonable to require that all the subsequent search directions we gen- 
erate in CG are also ^-conjugate to this subspace. This will ensure that at any step K of 
CG, we obtain the global minimum of \P(x) in the space spanned by Xi' s an d p^ (k = 1..K). 
This requirement is implemented by almost the same technique used in conventional precon- 
ditioners for CG [pTf . In each CG iteration, the residual r*. is replaced by an improved search 
direction vector z^. Thus we propose for our Projected Conjugate Gradient (PCG) method 
that the vector z is chosen to be the A orthogonal component of r with respect to span(xi) or 
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in terms of the basis vectors basis v n denned above for our MRE algorithm (see Section V) 

z = r-X n v n , (20) 

where the condition v^Az = means 

G nm \ m = vlM^Mr (21) 

with A = M^M and G nm = v n ^ Mv m defined as before. The coefficients \ n are found by 
solving a small linear system using the Gauss-Jordan elimination method or the LU factor- 
ization. The amount of computation for solving such a system is very small, especially if one 
uses the LU factorization, since the matrix G nm is computed once and the LU factors can be 
used in subsequent steps. As a result we modify the conjugate gradient routine as follows: 



PCG Algorithm 



• Project residue: z k = r k - w n (G _1 ) ni?n ('U^M t Mr fc ) 

• Compute search direction: p k = Zk-i - Pk-iPk-u where f3 k -i = r\,_ 1 Zk-i/r\._ 2 Zk-2- 

• Compute new solution: Xk = Xk-l + &kPk, where a k = r\._ l Zk-.i/p\M^Mp k 

• Compute new residual: r k = r k _\ — M^Mp k . 



It can easily be shown that the above algorithm produces a set of search vectors that are A- 
conjugate among themselves and to the Uj's. The only time consuming step, needed every CG 
iteration, is the computation of the right-hand side v n ^ Mr k , but the vectors v n = M^Mv n 
have in fact already been computed in the MRE step so this only involves N scalar products. 
For our implementation on the MIT 128-node CM-5, one M^M\ operation is equivalent to 55 
v'r operations so with O(10) vectors the overhead is about 20%. This cost has been taken into 
account when analyzing the performance of this algorithm. The results of our performance 
tests are summarized in Tables |5|,0 and in Figures 0,14,15,16,17. 



Our simulations show that the Minimal Residual Extrapolation algorithm followed by 
the PCG algorithm gives us an extra 30% improvement over the Minimal Residual Extrap- 
olation followed by the standard CG. In general the actual improvement of the performance 
depends crucially on the dot product versus Matrix- Vector product speed that a given com- 
puter can achieve. 
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N=3 
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0.47 


0.55 
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0.74 


N=4 


0.14 
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0.51 


0.55 
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N=5 
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0.42 
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0.09 


0.18 
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0.30 


0.44 


N=10 


0.09 


0.18 


0.22 


0.24 


0.27 


0.28 


0.40 


N=ll 


0.09 


0.16 


0.21 


0.23 


0.25 


0.27 


0.39 


6t 


0.002 


0.005 


0.007 


0.008 


0.009 


0.010 


0.015 



Table 5: Number of the projective CG steps needed to converge to the solution. We used 
Minimal Residual Extrapolation to obtain the initial trial solution x- The table is normalized 
with respect to N = (i.e. no extrapolation, Xtrial = The statistical errors are of the 
order of 10%. 



VIII. Conclusions 



Let us summarize our perspective on the problem of accelerating a time sequence of 
conjugate gradient inverters. The chronological method has offered significant performance 
improvement, but at the same time the results are tantalizing. Just by starting from the old 
solution, the residue is reduced by 4 orders of magnitude to order 10 -4 , relative to the residue 
with x = ¥?• Then if we look at Figure 15, we see that the residue is reduce further by 5 orders 
of magnitude using 10 additional past solution vectors in our Minimal Residual Extrapolation 
method. Finally, the first 10 CG iterations accounts for another 2 orders of magnitude. 
However accurate reversibility ultimately requires us to reduce the residue by another 4 more 
orders. This last 4 orders of magnitude takes several hundred additional vectors in the CG 
iterations. We are both intrigued and frustrated by the observation that we can reduce the 
residue by 11 orders of magnitude in a 20-30 dimensional vector space, but then the standard 
CG iterations requires hundreds of additional search directions to accomplish the remaining 
4 orders of magnitude needed to satisfy adequately the reversibility constraint. It is tempting 
to hope that further improvements can be made on this last 4 orders of magnitude. 



We have emphasized the analytic properties of x(t) because we believe it may suggest 
ways to understand and further improve the chronological method. For example, the failure 
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Table 6: Computational Time ( CT) for the minimum residual - projective CG algorithm. The 
table is normalized to 1 at St = 0.010 for N = 1 ( Xtrial = Xlast)- 



at fixed St to improve the polynomial extrapolation by increasing indefinitely the number of 
terms is probably a signal of nearby singularities. Our success so far is probably due to the slow 
evolution of the low eigenvalues of the Dirac operator. Thus we are in essence taking advantage 
of "critical slowing down" in the HMC algorithm to accelerate the Dirac inverter. However, 
there may well be other vectors (besides the final solution) in the nearby past iteration that 
can better exploit the slow evolution of our matrix. For example, the last CG routine, which 
is closest in time to our present inverter, itself generates many A conjugate search vectors, 
that may be more useful than the older solutions exploited in our MRE method. In addition 
this subspace of past vectors might be used not just as a way to arrive at an initial guess, but 
also as a way to accelerate (or "precondition") the iterative process itself. In this spirit the 
Projective CG method presented in section VII was included to illustrate how in principle a 
set of past vectors can be exploited to accelerate the convergence of the CG algorithm itself. 
We are currently studying such chronological preconditioners more systematically. 

In conclusion, we found that the procedures presented here, and in particular the Min- 
imal Residual Extrapolation method reduces by a factor of about 2-3 the mean number of 
conjugate gradient iterations required to move in phase space for a fix molecular dynamics 
time. This is achieved with negligible extra computational time, at the expense of memory. 
Consequently when sufficient memory is available for storing the past solutions the Dirac in- 
verter, our Chronological Inversion method certainly provides one more useful trick for more 
efficiently generating full QCD configurations. 
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Figure 1: Action variation in a forward-backward trajectory. The initial action has been 
subtracted. R < 1CT 10 was used as a stopping criterion and St = 0.010. 
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Figure 2: Reversibility violations for the energy differences in a forward-backward trajectory 
with the stopping conditions R = KT 10 , 10 -11 , 1(T 12 (a) and R = 1(T 13 , KT 14 , 1(T 15 (b). 
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Figure 3: Number of CG steps vs. number of vectors for various values of 5t for the Minimal 
Residual Extrapolation method. 
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Figure 4: Starting residue vs. number of vectors for various values of 5t for the Minimal 
Residual Extrapolation method. 
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Figure 5: Contour plot of number of CG steps. 
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Figure 6: CT vs. number of vectors for various values of St for the Minimal Residual 
Extrapolation method. 
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Figure 8: Number of CG steps vs. number of vectors for various values of St for the Polynomial 
extrapolation. 
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Figure 9: Starting residue vs. number of vectors for various values of 8t for the polynomial 
extrapolation. 
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Figure 11: CT vs. Number of vectors for various values of St for the Polynomial extrapolation. 
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Figure 13: Number of CG steps vs. number of vectors for various values of St for the Projective 
CG. 
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Figure 14: Starting residue vs. number of vectors for various values of St for the Projective 
CG. 
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Figure 15: Contour plot of number of CG steps for the Projective CG. 
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Figure 16: CT vs. Number of vectors for various values of 5t for the Projective CG. 
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Figure 17: Contour plot of CT for the Projective CG. 
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Figure 18: Example of convergence of the CG as a function of the number of extrapola- 
tion vectors for a single lattice and St = 0.010. The method used was Minimal Residual 
Extrapolation and the number of vectors varies from (top line) to 11 (bottom line). 
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